128 research outputs found

    Incorporating Prior Knowledge in Deep Learning Models via Pathway Activity Autoencoders

    Full text link
    Motivation: Despite advances in the computational analysis of high-throughput molecular profiling assays (e.g. transcriptomics), a dichotomy exists between methods that are simple and interpretable, and ones that are complex but with lower degree of interpretability. Furthermore, very few methods deal with trying to translate interpretability in biologically relevant terms, such as known pathway cascades. Biological pathways reflecting signalling events or metabolic conversions are Small improvements or modifications of existing algorithms will generally not be suitable, unless novel biological results have been predicted and verified. Determining which pathways are implicated in disease and incorporating such pathway data as prior knowledge may enhance predictive modelling and personalised strategies for diagnosis, treatment and prevention of disease. Results: We propose a novel prior-knowledge-based deep auto-encoding framework, PAAE, together with its accompanying generative variant, PAVAE, for RNA-seq data in cancer. Through comprehensive comparisons among various learning models, we show that, despite having access to a smaller set of features, our PAAE and PAVAE models achieve better out-of-set reconstruction results compared to common methodologies. Furthermore, we compare our model with equivalent baselines on a classification task and show that they achieve better results than models which have access to the full input gene set. Another result is that using vanilla variational frameworks might negatively impact both reconstruction outputs as well as classification performance. Finally, our work directly contributes by providing comprehensive interpretability analyses on our models on top of improving prognostication for translational medicine

    Module detection in complex networks using integer optimisation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The detection of <it>modules or community structure </it>is widely used to reveal the underlying properties of complex networks in biology, as well as physical and social sciences. Since the adoption of modularity as a measure of network topological properties, several methodologies for the discovery of community structure based on modularity maximisation have been developed. However, satisfactory partitions of large graphs with modest computational resources are particularly challenging due to the NP-hard nature of the related optimisation problem. Furthermore, it has been suggested that optimising the modularity metric can reach a resolution limit whereby the algorithm fails to detect smaller communities than a specific size in large networks.</p> <p>Results</p> <p>We present a novel solution approach to identify community structure in large complex networks and address resolution limitations in module detection. The proposed algorithm employs modularity to express network community structure and it is based on mixed integer optimisation models. The solution procedure is extended through an iterative procedure to diminish effects that tend to agglomerate smaller modules (resolution limitations).</p> <p>Conclusions</p> <p>A comprehensive comparative analysis of methodologies for module detection based on modularity maximisation shows that our approach outperforms previously reported methods. Furthermore, in contrast to previous reports, we propose a strategy to handle resolution limitations in modularity maximisation. Overall, we illustrate ways to improve existing methodologies for community structure identification so as to increase its efficiency and applicability.</p

    CytoASP: a Cytoscape app for qualitative consistency reasoning, prediction and repair in biological networks

    Get PDF
    Background: Qualitative reasoning frameworks, such as the Sign Consistency Model (SCM), enable modelling regulatory networks to check whether observed behaviour can be explained or if unobserved behaviour can be predicted. The BioASP software collection offers ideal tools for such analyses. Additionally, the Cytoscape platform can offer extensive functionality and visualisation capabilities. However, specialist programming knowledge is required to use BioASP and no methods exist to integrate both of these software platforms effectively. Results: We report the implementation of CytoASP, an app that allows the use of BioASP for influence graph consistency checking, prediction and repair operations through Cytoscape. While offering inherent benefits over traditional approaches using BioASP, it provides additional advantages such as customised visualisation of predictions and repairs, as well as the ability to analyse multiple networks in parallel, exploiting multi-core architecture. We demonstrate its usage in a case study of a yeast genetic network, and highlight its capabilities in reasoning over regulatory networks. Conclusion: We have presented a user-friendly Cytoscape app for the analysis of regulatory networks using BioASP. It allows easy integration of qualitative modelling, combining the functionality of BioASP with the visualisation and processing capability in Cytoscape, and thereby greatly simplifying qualitative network modelling, promoting its use in relevant projects

    Optimal Piecewise Linear Regression Algorithm for QSAR Modelling

    Get PDF
    Quantitative Structure‐Activity Relationship (QSAR) models have been successfully applied to lead optimisation, virtual screening and other areas of drug discovery over the years. Recent studies, however, have focused on the development of models that are predictive but often not interpretable. In this article, we propose the application of a piecewise linear regression algorithm, OPLRAreg, to develop both predictive and interpretable QSAR models. The algorithm determines a feature to best separate the data into regions and identifies linear equations to predict the outcome variable in each region. A regularisation term is introduced to prevent overfitting problems and implicitly selects the most informative features. As OPLRAreg is based on mathematical programming, a flexible and transparent representation for optimisation problems, the algorithm also permits customised constraints to be easily added to the model. The proposed algorithm is presented as a more interpretable alternative to other commonly used machine learning algorithms and has shown comparable predictive accuracy to Random Forest, Support Vector Machine and Random Generalised Linear Model on tests with five QSAR data sets compiled from the ChEMBL database

    Optimisation Models for Pathway Activity Inference in Cancer

    Get PDF
    BACKGROUND: With advances in high-throughput technologies, there has been an enormous increase in data related to profiling the activity of molecules in disease. While such data provide more comprehensive information on cellular actions, their large volume and complexity pose difficulty in accurate classification of disease phenotypes. Therefore, novel modelling methods that can improve accuracy while offering interpretable means of analysis are required. Biological pathways can be used to incorporate a priori knowledge of biological interactions to decrease data dimensionality and increase the biological interpretability of machine learning models. METHODOLOGY: A mathematical optimisation model is proposed for pathway activity inference towards precise disease phenotype prediction and is applied to RNA-Seq datasets. The model is based on mixed-integer linear programming (MILP) mathematical optimisation principles and infers pathway activity as the linear combination of pathway member gene expression, multiplying expression values with model-determined gene weights that are optimised to maximise discrimination of phenotype classes and minimise incorrect sample allocation. RESULTS: The model is evaluated on the transcriptome of breast and colorectal cancer, and exhibits solution results of good optimality as well as good prediction performance on related cancer subtypes. Two baseline pathway activity inference methods and three advanced methods are used for comparison. Sample prediction accuracy, robustness against noise expression data, and survival analysis suggest competitive prediction performance of our model while providing interpretability and insight on key pathways and genes. Overall, our work demonstrates that the flexible nature of mathematical programming lends itself well to developing efficient computational strategies for pathway activity inference and disease subtype prediction

    Functional and Topological Properties in Hepatocellular Carcinoma Transcriptome

    Get PDF
    Hepatocellular carcinoma (HCC) is a leading cause of global cancer mortality. However, little is known about the precise molecular mechanisms involved in tumor formation and pathogenesis. The primary goal of this study was to elucidate genome-wide molecular networks involved in development of HCC with multiple etiologies by exploring high quality microarray data. We undertook a comparative network analysis across 264 human microarray profiles monitoring transcript changes in healthy liver, liver cirrhosis, and HCC with viral and alcoholic etiologies. Gene co-expression profiling was used to derive a consensus gene relevance network of HCC progression that consisted of 798 genes and 2,012 links. The HCC interactome was further confirmed to be phenotype-specific and non-random. Additionally, we confirmed that co-expressed genes are more likely to share biological function, but not sub-cellular localization. Analysis of individual HCC genes revealed that they are topologically central in a human protein-protein interaction network. We used quantitative RT-PCR in a cohort of normal liver tissue (n = 8), hepatitis C virus (HCV)-induced chronic liver disease (n = 9), and HCC (n = 7) to validate co-expressions of several well-connected genes, namely ASPM, CDKN3, NEK2, RACGAP1, and TOP2A. We show that HCC is a heterogeneous disorder, underpinned by complex cross talk between immune response, cell cycle, and mRNA translation pathways. Our work provides a systems-wide resource for deeper understanding of molecular mechanisms in HCC progression and may be used further to define novel targets for efficient treatment or diagnosis of this disease

    Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

    Get PDF
    The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing

    Genome sequences and great expectations

    Get PDF
    To assess how automatic function assignment will contribute to genome annotation in the next five years, we have performed an analysis of 31 available genome sequences. An emerging pattern is that function can be predicted for almost two-thirds of the 73,500 genes that were analyzed. Despite progress in computational biology, there will always be a great need for large-scale experimental determination of protein function
    corecore